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Distance correlation is a new class of multivariate dependence 
coefficients applicable to random vectors of arbitrary and not neces- 
sarily equal dimension. Distance covariance and distance correlation 
are analogous to product-moment covariance and correlation, but 
generalize and extend these classical bivariate measures of depen- 
dence. Distance correlation characterizes independence: it is zero if 
and only if the random vectors are independent. The notion of co- 
variance with respect to a stochastic process is introduced, and it 
is shown that population distance covariance coincides with the co- 
variance with respect to Brownian motion; thus, both can be called 
Brownian distance covariance. In the bivariate case, Brownian covari- 
ance is the natural extension of product-moment covariance, as we 
obtain Pearson product-moment covariance by replacing the Brown- 
ian motion in the definition with identity. The corresponding statistic 
has an elegantly simple computing formula. Advantages of applying 
Brownian covariance and correlation vs the classical Pearson covari- 
ance and correlation are discussed and illustrated. 

1. Introduction. The importance of independence arises in diverse ap- 
plications, for inference and whenever it is essential to measure complicated 
dependence structures in bivariate or multivariate data. This paper focuses 
on a new dependence coefficient that measures all types of dependence be- 
tween random vectors X and Y in arbitrary dimension. Distance correlation 
and distance covariance (Szekely, Rizzo, and Bakirov [28]), and Brownian co- 
variance, introduced in this paper, provide a new approach to the problem of 
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measuring dependence and testing the joint independence of random vectors 
in arbitrary dimension. The corresponding statistics have simple computing 
formulae, apply to sample sizes n > 2 (not constrained by dimension), and 
do not require matrix inversion or estimation of parameters. For example, 
the distance covariance (dCov) statistic, derived in the next section, is the 
square root of 

1 " 
= — z2 A kl B kh 

k,l=l 

where A k i and B k i are simple linear functions of the pairwise distances 
between sample elements. It will be shown that the definitions of the new 
dependence coefficients have theoretical foundations based on characteristic 
functions and on the new concept of covariance with respect to Brownian 
motion. Our independence test statistics are consistent against all types of 
dependent alternatives with finite second moments. 

Classical Pearson product-moment correlation (p) and covariance mea- 
sure linear dependence between two random variables, and in the bivariate 
normal case p = is equivalent to independence. In the multivariate normal 
case, a diagonal covariance matrix S implies independence, but is not a suffi- 
cient condition for independence in the general case. Nonlinear or nonmono- 
tone dependence may exist. Thus, p or E do not characterize independence 
in general. 

Although it does not characterize independence, classical correlation is 
widely applied in time series, clinical trials, longitudinal studies, modeling 
financial data, meta-analysis, model selection in parametric and nonpara- 
metric models, classification and pattern recognition, etc. Ratios and other 
methods of combining and applying correlation coefficients have also been 
proposed. An important example is maximal correlation, characterized by 
Renyi [22]. 

For multivariate inference, methods based on likelihood ratio tests (LRT) 
such as Wilks' Lambda [32] or Puri-Sen [20] are not applicable if dimen- 
sion exceeds sample size, or when distributional assumptions do not hold. 
Although methods based on ranks can be applied in some problems, many 
classical methods are effective only for testing linear or monotone types of 
dependence. 

There is much literature on testing or measuring independence. See, for 
example, Blomqvist [3], Blum, Kiefer, and Rosenblatt [4], or methods out- 
lined in Hollander and Wolfe [16] and Anderson [1]. Multivariate nonpara- 
metric approaches to this problem can be found in Taskinen, Oja, and Ran- 
dies [30], and the references therein. 

Our proposed distance correlation represents an entirely new approach. 
For all distributions with finite first moments, distance correlation 1Z gen- 
eralizes the idea of correlation in at least two fundamental ways: 
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(i) 1Z(X, Y) is defined for X and Y in arbitrary dimension. 

(ii) 1Z(X,Y) = characterizes independence of X and Y. 

The coefficient 1Z(X,Y) is a standardized version of distance covariance 
V(X, Y), defined in the next section. Distance correlation satisfies < 1Z < 1, 
and 1Z = only if X and Y are independent. In the bivariate normal case, 
1Z is a deterministic function of p, and 7£(.X~, Y) < \p(X,Y)\ with equality 
when p = ±1. 

Thus, distance covariance and distance correlation provide a natural ex- 
tension of Pearson product-moment covariance o~x,Y and correlation p, and 
new methodology for measuring dependence in all types of applications. 

The notion of covariance of random vectors (X, Y) with respect to a 
stochastic process U is introduced in this paper. This new notion Covu(X, Y) 
contains as distinct special cases distance covariance V 2 (X, Y) and, for bi- 
variate (X,Y), o\y- The title of this paper refers to Cov\y(X,Y), where 
W is a Wiener process. 

Brownian covariance W = W(X,Y) is based on Brownian motion or 
Wiener process for random variables I"£K P and Y S M q with finite second 
moments. An important property of Brownian covariance is that W(X, Y) = 
if and only if X and Y are independent. 

A surprising result develops: the Brownian covariance is equal to the 
distance covariance. This equivalence is not only surprising, it also shows 
that distance covariance is a natural counterpart of product-moment covari- 
ance. For bivariate (X,Y), by considering the simplest nonrandom func- 
tion, identity (id), we obtain Covid(X, Y) = c 2 XY - Then by considering 
the most fundamental random processes, Brownian motion W, we arrive 
at Cov\v(X,Y) = V 2 (X, Y). Brownian correlation is a standardized Brown- 
ian covariance, such that if Brownian motion is replaced with the identity 
function, we obtain the absolute value of Pearson's correlation p. 

A further advantage of extending Pearson correlation with distance corre- 
lation is that while uncorrelatedness (p = 0) can sometimes replace indepen- 
dence, for example, in proving some classical laws of large numbers, uncor- 
relatedness is too weak to imply a central limit theorem, even for strongly 
stationary summands (see Bradley [7-9]). On the other hand, a central limit 
theorem for strongly stationary sequences of summands follows from 1Z = 
type conditions (Szekely and Bakirov [25]). 

Distance correlation and distance covariance are presented in Section 2. 
Brownian covariance is introduced in Section 3. Extensions and applications 
are discussed in Sections 4 and 5. 

2. Distance covariance and distance correlation. Let X in W and Y 

in be random vectors, where p and q are positive integers. The lower 
case fx and fy will be used to denote the characteristic functions of X 
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and Y, respectively, and their joint characteristic function is denoted fx,Y- 
In terms of characteristic functions, X and Y are independent if and only 
if fx,Y = fxfY- Thus, a natural approach to measuring the dependence 
between X and Y is to find a suitable norm to measure the distance between 
f x ,Y and fxfy- 

Distance covariance V is a measure of the distance between fxY and the 
product fxfY- A norm || • || and a distance ||/x,y — fxfy\\ are defined in 
Section 2.2. Then an empirical version of V is developed and applied to test 
the hypothesis of independence 



In Szekely et al. [28] an omnibus test of independence based on the sam- 
ple distance covariance V is introduced that is easily implemented in ar- 
bitrary dimension without requiring distributional assumptions. In Monte 
Carlo studies, the distance covariance test exhibited superior power rela- 
tive to parametric or rank-based likelihood ratio tests against nonmonotone 
types of dependence. It was also demonstrated that the tests were quite 
competitive with the parametric likelihood ratio test when applied to multi- 
variate normal data. The practical message is that distance covariance tests 
are powerful tests for all types of dependence. 

2.1. Motivation. 

Notation. The scalar product of vectors t and s is denoted by (t, s) . For 
complex-valued functions /(•), the complex conjugate of / is denoted by / 
and |/| 2 = //• The Euclidean norm of x in W is \x\ p . A primed variable 
X' is an independent copy of X; that is, X and X' are independent and 
identically distributed (i.i.d.). 

For complex functions 7 defined on R p x R 9 , the || • H^-norm in the 
weighted L2 space of functions on M. p+q is defined by 



where w(t,s) is an arbitrary positive weight function for which the integral 
above exists. 

With a suitable choice of weight function w(t, s), discussed below, we shall 
define a measure of dependence 



Hq: fx,Y = fxfy vs H\: fx,Y 7^ fxfy- 



(2.1) 




V 2 (X,Y;w) = ||/x,rM - fx{t)f Y {s)\\l 



(2.2) 
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which is analogous to classical covariance, but with the important property 
that V 2 (X, Y;w) = if and only if X and Y are independent. In what follows, 
w is chosen such that we can also define 

V 2 {X-w) = V 2 (X,X;w) = \\fx,x&s) ~ fx(t)fx(s)f w 
\fx,x(t, s) - fx{t)f x (s)\ 2 w{t, s) dtds, 



and similarly define V 2 (Y; w). Then a standardized version of V(X, Y; w) is 

V(X,Y;w) 



^/V(X;w)V(Y-wY 



a type of unsigned correlation. 

In the definition of the norm (2.1) there are more than one potentially 
interesting and applicable choices of weight function w, but not every w 
leads to a dependence measure that has desirable statistical properties. Let 
us now discuss the motivation for our particular choice of weight function 
leading to distance covariance. 

At least two conditions should be satisfied by the standardized coefficient 
T^-w ■ 

(i) Tlw > and 1Z W = only if independence holds. 

(ii) 1Z W is scale invariant, that is, invariant with respect to transforma- 
tions (X, Y) H- (eX, eY), for e > 0. 

However, if we consider integrable weight function w(t,s), then for X and 
Y with finite variance 

hm - =AX,Y). 
*r+o ^/V 2 {eX-w)V 2 {eY-w) 

The above limit is obtained by considering the Taylor expansions of the 
underlying characteristic functions. Thus, if the weight function is integrable, 
1Z W can be arbitrarily close to zero even if X and Y are dependent. By using 
a suitable nonintegrable weight function, we can obtain an 1Z W that satisfies 
both properties (i) and (ii) above. 

Considering the operations on characteristic functions involved in eval- 
uating the integrand in (2.2), a promising solution to the choice of weight 
function w is suggested by the following lemma. 

Lemma 1. If < a < 2, then for all x in R d 
1 — cos(i, x) 



— dt = C(d,a)\x\ d , 
rid 



a 
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where 

° {d,a) = a2«T({d + a)/2y 

and r(-) is the complete gamma function. The integrals at and oo are 
meant in the principal value sense: lim^o J^d\r EB+e -i B c\, where B is the 

unit ball (centered at 0) in R d and B c is the complement of B. 

A proof of Lemma 1 is given in Szekely and Rizzo [27]. Lemma 1 suggests 
the weight functions 

(2.3) w(t,s;a) = (C(p,a)C(q,a)\t\ p p +a \s\ q q +a )- 1 , < a < 2. 

The weight functions (2.3) result in coefficients 1Z W that satisfy the scale 
invariance property (ii) above. 

In the simplest case corresponding to a = 1 and Euclidean norm \x\, 

(2.4) w(t, S ) = (c p c q \t\l + P\ S \ 1 q +q r 1 , 
where 

-(1+^/2 

(2-5) c d = C(d,l) = — — -. 

v ; v ; r((l + d)/2) 

(The constant 2cd is the surface area of the unit sphere in R rf+1 .) 

Remark 1. Lemma 1 is applied to evaluate the integrand in (2.2) for 
weight functions (2.3) and (2.4). For example, if a = 1 (2.4), then by Lemma 
1 there exist constants c p and c q such that for X in W and 7 in 1 ? , 

l-exp W ,X)} / l-expffi.y)} 



t\p JR<1 \S\q 

eMi(t,X)+i(s,Y)} dtds = 



I^lp Pi? 

Distance covariance and distance correlation are a class of dependence 
coefficients and statistics obtained by applying a weight function of the type 
(2.3), < a < 2. This type of weight function leads to a simple product- 
average form of the covariance (2.8) analogous to Pearson covariance. Other 
interesting weight functions could be considered (see, e.g., Bakirov, Rizzo 
and Szekely [2]), but only the weight functions (2.3) lead to distance covari- 
ance type statistics (2.8). 

In this paper we apply weight function (2.4) and the corresponding weighted 
L/2 norm || • ||, omitting the index w, and write the dependence measure (2.2) 
as V 2 (X, Y). Section 4.1 extends our results for a G (0,2). 

For fmiteness of ||/x,Y (t, s) — fx(t)fy (s)\\ 2 , it is sufficient that E\X\ p < oo 
and .E|y~| 9 < oo. 



BROWNIAN COVARIANCE 



7 



2.2. Definitions. 

Definition 1. The distance covariance (dCov) between random vec- 
tors X and Y with finite first moments is the nonnegative number V(X, Y) 
defined by 

V 2 (A",F) = \\fx,Y(t,s) - fx(t)fy(s)f 

<2 ' 6) 1 I W')-Ml«-lE to 



CpCq J RP+q |4 +P l s lg +9 
Similarly, distance variance (dVar) is defined as the square root of 
V\X) = V 2 (X,X) = \\fx,x&*) ~ fx(t)f x (s)\\ 2 . 

By definition of the norm || • || , it is clear that V(X, Y) > and V(X, Y) = 
if and only if X and Y are independent. 

Definition 2. The distance correlation (dCor) between random vectors 
X and Y with finite first moments is the nonnegative number 1Z(X, Y) 
defined by 



(2.7) K 2 (X,Y) 



V 2 (X)V 2 (Y)>0; 



^JV 2 (X)V*(Y) 
10, V 2 (X)V 2 (Y) = 



Several properties of 1Z analogous to p are given in Theorem 3. Results 
for the special case of bivariate normal (X,Y) are given in Theorem 6. 

The distance dependence statistics are defined as follows. For a random 
sample (X, Y) = {(X^, : k = 1, . . . , n} of n i.i.d. random vectors (X, Y) 
from the joint distribution of random vectors X in M p and Y in ~R q , compute 
the Euclidean distance matrices {a^i) = {\Xy. — Xi\ p ) and (bki) = (\Y k — Yi\ q ). 
Define 

Ml = aki-a k .-a.i+a.., k,l = l,..., n, 

where 

Y n 1 n 1 n 

o-k- = ~/ J a kU a.i,= — > a.. = — ^ > ajw- 

1=1 k=l kJ=l 



Similarly, define B^i = bki — bf.. — b.\ + b.., for k, I = 1, . . . , n. 
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Definition 3. The nonnegative sample distance covariance V n (X,Y) 
and sample distance correlation 7£ n (X, Y) are denned by 



(2.8) 
and 

(2.9) 



1 n 
n z z — ' 



^n(X,Y) = 



V*(X,Y) 



V2(X)V2(Y)>0; 



10, V2(X)V,2(Y) = 0, 

respectively, where the sample distance variance is defined by 



(2.10) 



1 n 

V2(X) = V*(X,X) = -£ A 



The nonnegativity of and may not be immediately obvious from 
the definitions above, but this property as well as the motivation for the 
definitions of the statistics will become clear from Theorem 1 below. 



2.3. Properties of distance covariance. Several interesting properties of 
distance covariance are obtained. Results in this section are summarized as 
follows: 

(i) Equivalent definition of V n in terms of empirical characteristic func- 
tions and norm || • ||. 

(ii) Almost sure convergence V n — > V and TZ? n — > 1Z 2 . 

(iii) Properties of V(X,Y), V(X), and TZ(X,Y). 

(iv) Properties of TZ n and V n . 

(v) Weak convergence of nV 2 , the limit distribution of nV 2 , and statis- 
tical consistency. 

(vi) Results for the bivariate normal case. 

Many of these results were obtained in Szekely et al. [28]. Here we give the 
proofs of new results and readers are referred to [28] for more details and 
proofs of our previous results. 

An equivalent definition of V n . The coefficient V(X, Y) is defined in 
terms of characteristic functions, thus, a natural approach is to define the 
statistic V n (X, Y) in terms of empirical characteristic functions. The joint 
empirical characteristic function of the sample, {(-Xi, Yi), . . . , (X n ,Y n )}, is 

1 n 

fx,Y^ s ) = -Y,ew{i(t,X k )+i{s,Y k )}. 
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The marginal empirical characteristic functions of the X sample and Y 
sample are 

1 n n 

fx(t) = - J>x P {i<t,x fc )}, = -^ exp {.( s ,y fe )}, 

k=l k=l 

respectively. Then an empirical version of distance covariance could have 
been defined as y (i, s) — /x(i)/y(s)||, where the norm || • || is defined by 
the integral as above in (2.1). Theorem 1 establishes that this definition is 
equivalent to Definition 3. 

Theorem 1 . If (X, Y) is a sample from the joint distribution of (X, Y), 
then 

V 2 n (X,Y) = \\fl Y (t,s)-fmms)f- 

The proof applies Lemma 1 to evaluate the integral Y {pi s ) ~ /x(*)/?( s ) 
with w(t,s) = {c p Cq\t\p +p \s\l +q }~ 1 . An intermediate result is 

(2.11) H/VM - mt)fy(s)\\ 2 =Ti+T 2 - 2T 3 , 

where 

1 - 

T\ = — g |-Xfc — Xj| p |Yfc — Yi\ q , 

k,l=l 

^ n 1 n 

fc,i=l fc,J=l 

Y n n 

T3 = ^Y1 Yl \X k ~ Xi\ p \Y k -Y m \ q . 

k=l l,m=l 

Then the algebraic identity T 1 + T 2 - 2T 3 = V^(X, Y), where V*(X, Y) is 
given by Definition 3, is established to complete the proof. 

As a corollary to Theorem 1, we have V^(X, Y) > 0. It is also easy to 
see that the statistic V n (X) = if and only if every sample observation is 
identical. If V n (X) = 0, then Am = for k, I = 1, . . . , n. Thus, = A^ = 
—~a~k. — o.fc + a., implies that a^. = a.^ = a../2, and 

= Am = a^i — afc. — a.i + a.. = a^i = \X k — Xi\ p , 

so X 1 = ■ ■ - = X n . 



Remark 2. The simplicity of formula (2.8) for V n in Definition 3 has 
practical advantages. Although the identity (2.11) in Theorem 1 provides 
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an alternate computing formula for V„, the original formula in Definition 3 
is simpler and requires less computing time (1/3 less time per statistic on 
our current machine, for sample size 100). Reusable computations and other 
efficiencies possible using the simpler formula (2.8) execute our permutation 
tests in 94% to 98% less time, which depends on the number of replicates. 
It is straightforward to apply resampling procedures without the need to re- 
compute the distance matrices. See Example 5, where a jackknife procedure 
is illustrated. 

Theorem 2. If E\X\ P < oo and E\Y\ q < oo, then almost surely 

lim V n (X,Y) = V(X,Y). 

n— >oo 

Corollary 1. If E(\X\ P + \Y\ q ) < oo, then almost surely 
lim 7^(X,Y) = K 2 (X,Y). 

n— >oo 

Theorem 3. For random vectors X G W and Y G M 9 such that E(\X\ p + 
\Y\ q ) < oo, the following properties hold: 

(i) < TZ(X, Y) < 1, and 7Z = if and only if X and Y are independent. 

(ii) V(oi + hCiX,a 2 + b 2 C 2 Y) = \f\hh\ V(X, Y), for all constant vec- 
tors a\ G M p , a 2 G K 9 , scalars b\, b 2 and orthonormal matrices C\, C 2 in W 
and M q , respectively. 

(hi) If the random vector (X\,Y{) is independent of the random vector 
(X 2 ,Y 2 ), then 

V{X l + X 2 ,Yi + Y 2 ) < V(X 1 ,Y 1 ) + V(X 2 ,Y 2 ). 

Equality holds if and only if X\ and Y\ are both constants, or X 2 and Y 2 
are both constants, or X\, X 2 , Yj, Y 2 are mutually independent. 

(iv) V(X) = implies that X = E[X], almost surely. 

(v) V(a + bCX) = \b\V(X), for all constant vectors a in M p , scalars b, 
and p x p orthonormal matrices C. 

(vi) If X and Y are independent, then V(X + Y) < V(X) + V(Y). Equal- 
ity holds if and only if one of the random vectors X orYis constant. 

Proofs of statements (hi) and (vi) are given in the Appendix. 

Theorem 4. 

(i) V n (X,Y)>0. 

(ii) V n (X) = if and only if every sample observation is identical. 
(hi) 0<7^(X,Y)<1. 
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(iv) 7£ n (X,Y) = 1 implies that the dimensions of the linear subspaces 
spanned by X and Y respectively are almost surely equal, and if we assume 
that these subspaces are equal, then in this subspace 

Y = a + bXC 

for some vector a, nonzero real number b, and orthogonal matrix C. 

Theorem 3 and the results below for the dCov test can be applied in a 
wide range of problems in statistical modeling and inference, including non- 
parametric models, models with multivariate response, or when dimension 
exceeds sample size. Some applications are discussed in Section 5. 

Asymptotic properties of nV 2 . A multivariate test of independence is 
determined by nV 2 or nV^/T2, where T2 = a..b.. is as defined in Theorem 1. If 
we apply the latter version, it normalizes the statistic so that asymptotically 
it has expected value 1. Then if i?(|X| p + |^|g) < 00, under independence, 
nV^/T2 converges in distribution to a quadratic form 

00 

(2.12) Q^J^XjZj, 

i=i 

where Zj are independent standard normal random variables, {Xj} are non- 
negative constants that depend on the distribution of (X, Y), and E[Q] = 1. 
A test of independence that rejects independence for large nV 2 /T2 (or nV 2 ) 
is statistically consistent against all alternatives with finite first moments. 

In the next theorem we need only assume finiteness of first moments for 
weak convergence of nV 2 under the independence hypothesis. 

Theorem 5 (Weak convergence). If X and Y are independent and 
E(\X\ p + \Y\ q ) <oo, then 

nV* A ||C(M)|| 2 , 



n— too 



where £(•) is a complex-valued zero mean Gaussian random process with 
covariance function 



R(u,u ) = (f x (t - t ) - fx(t)fx(to))(fY(s - s ) - f Y (s)f Y (so)), 
foru = (t,s), n = (to, s ) £R P xf«. 

Corollary 2. If E(\X\ P + \Y\ q ) < 00, then 

(i) If X and Y are independent, then nV 2 l /T2 — ^> Q where Q is a non- 

negative quadratic form of centered Gaussian random variables (2.12) and 
E[Q] = 1. 



12 



G. J. SZEKELY AND M. L. RIZZO 



(ii) If X and Y are independent, then nV\ — > Q\ where Q\ is a non- 

n— >oo 

negative quadratic form of centered Gaussian random variables and E[Q\] = 
E\X - X'\E\Y -Y'\. 

(iii) If X and Y are dependent, then nV%/T2 oo and nV% oo. 

n— >oo n— >oo 

Corollary 2(i), (ii) guarantees that the dCov test statistic has a proper 
limit distribution under the hypothesis of independence for all X and Y with 
finite first moments, while Corollary 2 (iii) shows that under any dependent 
alternative, the dCov test statistic tends to infinity (stochastically). Thus, 
the dCov test of independence is statistically consistent against all types of 
dependence. 

The dCov test is easy to implement as a permutation test, which is the 
method that we applied in our examples and power comparisons. For the 
permutation test implementation one can apply test statistic nV\. Large 
values of nV% (or nV^jT^) are significant. The dCov test and test statistics 
are implemented in the energy package for R in functions dcov.test, dcov, 
and dcor [21, 23]. 

We have also obtained a result that gives an asymptotic critical value 
applicable to arbitrary distributions. If Q is a quadratic form of centered 
Gaussian random variables and E[Q] = 1, then 

P{Q>xi-M)}<u 

for all < a < 0.215, where xf_ Q ,(l) is the (1 — a) quantile of a chi-square 
variable with 1 degree of freedom. This result follows from a theorem of 
Szekely and Bakirov [26], page 181. 

Thus, a test that rejects independence if nV^/T2 > Xi- a 0-) nas an as y m P _ 
totic significance level at most a. This test criterion could be quite conser- 
vative for many distributions. Although this critical value is conservative, 
it is a sharp bound; the upper bound a is achieved when X and Y are 
independent Bernoulli variables. 

Results for the bivariate normal distribution. When (X, Y) has a bivari- 
ate normal distribution, there is a deterministic relation between TZ and 

\p\- 

Theorem 6. If X and Y are standard normal, with correlation p = 
p(X,Y), then: 

(i) K(X,Y)<\p\, 

/■■\ < T )2(y y\ _ parcsinp+A/l-p 2 -parcsin(p/2)^ A /4-p 2 + l 

\U) K. {A, I ) - l +7r /3-V3 

(iii) inf^ = lirn^o = 2(1+ , /3 1 _ v ^ )1/2 ~ 0.89066. 
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The relation between 7Z and p for a bivariate normal distribution is shown 
in Figure 1. 



3. Brownian covariance. To introduce the notion of Brownian covari- 
ance, let us begin by considering the squared product-moment covariance. 
Recall that a primed variable X' denotes an i.i.d. copy of the unprimed sym- 
bol X. For two real- valued random variables, the square of their classical 



covariance is 



(3.1) 



E\X - E{X)){Y - E{Y))] 

= E[(X - E(X))(X' - E{X')){Y - E(Y))(Y' - E(Y'))]. 



Now we generalize the squared covariance and define the square of condi- 
tional covariance, given two real- valued stochastic processes U(-) and V(-). 
We obtain an interesting result when U and V are independent Weiner pro- 
cesses. 

First, to center the random variable X in the conditional covariance, we 
need the following definition. Let X be a real-valued random variable and 
{U(t) :t £ M 1 } a real- valued stochastic process, independent of X. The U- 
centered version of X is defined by 



(3.2) 



X v = U(X) 



U(t)dF x (t) = U(X) - E{U(X)\U], 




Fig. 1. Dependence coefficient 1Z 2 (solid line) and correlation p 2 (dashed line) in the 
bivariate normal case. 
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whenever the conditional expectation exists. 

Note that if id is identity, we have X^ = X — E[X\. The important ex- 
amples in this paper apply Brownian motion/Weiner processes. 

3.1. Definition of Brownian covariance. Let W be a two-sided one-dimen- 
sional Brownian motion/ Wiener process with expectation zero and covari- 
ance function 

(3.3) \s\ + \t\ - \s-t\ = 2mm(s,i), t,s>0. 

This is twice the covariance of the standard Wiener process. Here the factor 
2 simplifies the computations, so throughout the paper, covariance function 

(3.3) is assumed for W. 

Definition 4. The Brownian covariance or the Wiener covariance of 
two real- valued random variables X and Y with finite second moments is a 
non-negative number defined by its square 

(3.4) W 2 (X,Y) = Coy 2 w (X,Y) = E[X W X' W Y W ,Y{ V ,1 
where (W,W) does not depend on (X, Y, X',Y'). 

Note that if W in Covyy is replaced by the (nonrandom) identity func- 
tion id, then Cov^(X, Y) = | Cov(X, Y)\ = \ox,y\, the absolute value of 
Pearson's product-moment covariance. While the standardized product-moment 
covariance, Pearson correlation (/?), measures the degree of linear relation- 
ship between two real-valued variables, we shall see that standardized Brow- 
nian covariance measures the degree of all kinds of possible relationships 
between two real-valued random variables. 

The definition of Covvy(X, Y) can be extended to random processes in 
higher dimensions as follows. If X is an ]R p -valued random variable, and U (s) 
is a random process (random field) defined for all sGff and independent 
of X , define the ^-centered version of X by 

Xu = U(X)-E[U(X)\U], 

whenever the conditional expectation exists. 

Definition 5. If X is an MP- valued random variable, Y is an Revalued 
random variable, and U(s) and V(t) are arbitrary random processes (random 
fields) defined for all s £ W, t £ W, then the (17, V) covariance of (X,Y) is 
defined as the nonnegative number whose square is 

(3.5) Cov 2 u>v (X,Y) = EiXuX'jjYvYlr], 
whenever the right-hand side is nonnegative and finite. 
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In particular, if W and W' are independent Brownian motions with co- 
variance function (3.3) on MP, and M Q respectively, the Brownian covariance 
of X and Y is defined by 

(3.6) W\X, Y) = Cov 2 w {X, Y) = Coy^ w ,(X, Y). 

Similarly, for random variables with finite variance define the Brownian 
variance by 

W(X) =Vav w (X) = Cov w (X,X). 
Definition 6. The Brownian correlation is defined as 
Cor^,y )= ^Bll 

yJW(X)W(Y) 

whenever the denominator is not zero; otherwise Corv^(A", Y) = 0. 

In the following sections we prove that Covw(X, Y) exists for random 
vectors X and Y with finite second moments, and derive the Brownian 
covariance in this case. 

3.2. Existence of W(X,Y). In the following, the subscript on Euclidean 
norm \x\d for x G M d is omitted when the dimension is self-evident. 

Theorem 7. If X is an W -valued random variable, Y is an MS -valued 
random variable, and E(\X\ 2 + \Y\ 2 ) <oo, then E[XwX' w Yy/iY^,] is non- 
negative and finite, and 

W 2 (X,Y) = E[X W X' W Y W ,Y^,] 

(3.7) = E\X - X'\\Y - Y'\ + E\X - X'\E\Y - Y'\ 

- E\X - X'\\Y -Y"\ -E\X-X"\\Y-Y'\, 
where (X,Y), (X',Y'), and {X" ,Y") are i.i.d. 

PROOF. Observe that 

E\X W X' W Y W ,Y^ = ElEiXwYw^'wY^W^W 1 )) 

= E\E{X w Y w ,\W,W')E(X l w Yl JV ,\W,W l )\ 

= E[E(X W Y W ,\W,W')} 2 , 

and this is always nonnegative. For finiteness, it is enough to prove that all 
factors in the definition of Coviy (X, Y) have finite fourth moments. Equation 
(3.7) relies on the special form of the covariance function (3.3) of W. The 
remaining details are in the Appendix. □ 

See Section 4.1 for definitions and extension of results for the general 
case of fractional Brownian motion with Hurst parameter < H < 1 and 
covariance function \t\ 2H + \s\ 2H — \t — s\ 2H . 
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3.3. The surprising coincidence: W = V. 



Theorem 8. For arbitrary X G W p , Y € M q with finite second moments 

W(X,Y)=V(X,Y). 

Proof. Both V and W are nonnegative, hence, it is enough to show 
that their squares coincide. Lemma 1 can be applied to evaluate V 2 (X, Y). 
In the numerator of the integral we have terms like 

E[cos{X -X',t) cos(y -Y',s)}, 

where X, X' are i.i.d. and Y,Y' are i.i.d. Now apply the identity 

cos u cos v = 1 — (1 — cosu) — (1 — cosv) + (1 — cosu)(l — cos?;) 

and Lemma 1 to simplify the integrand. After cancelation in the numerator 
of the integrand, there remains to evaluate integrals of the type 

f [l-cosiX-X'tm-cosiY-Y's))]^ 



E 



\t\ 1+ P La \S\ 1+ 1 



= c p c q E\X - X'\E\Y -Y'\. 

Applying similar steps, after further simplification, we obtain 

V 2 (X, Y) = E\X - X'\\Y - Y'\ + E\X - X'\E\Y - Y'\ 

- E\X - X'\\Y - Y"\ - E\X - X"\\Y -Y'\, 

and this is exactly equal to the expression (3.7) obtained for W(X, Y) in 
Theorem 7. □ 



As a corollary to Theorem 8, the properties of Brownian covariance for 
random vectors X and Y with finite second moments are therefore the same 
properties established for distance covariance V(X, Y) in Theorem 3. 

The surprising result that Brownian covariance equals distance covariance 
dCov, exactly as defined in (2.6) for IgP and Y G M q , parallels a familiar 
special case when p = q = 1. For bivariate (X, Y) we found that TZ(X, Y) is 
a natural counterpart of the absolute value of the Pearson correlation. That 
is, if in (3.5) U and V are the simplest nonrandom function id, then we 
obtain the square of Pearson covariance a\ Y . Next, if we consider the most 
fundamental random processes, U = W and V = W , we obtain the square 
of distance covariance, V 2 (X, Y). 

Interested readers are referred to Szekely and Bakirov [25] for the back- 
ground of the interesting coincidence in Theorem 8. 
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4. Extensions. 

4.1. The class of a- distance dependence measures. In two contexts above 
we have introduced dependence measures based on Euclidean distance and 
on Brownian motion with Hurst index H = 1/2 (self-similarity index). Our 
definitions and results can be extended to a one-parameter family of dis- 
tance dependence measures indexed by a positive exponent < a < 2 on 
Euclidean distance, or equivalently by an index h, where h = 2H for Hurst 
parameters < H < 1. 

If E(\X\° + |y|«) < oo define by its square 



V 2 ^(X,Y) = \\f x , Y (t,s)- f x (t)f Y (s^ 

1 f \fx,y(t,s)-fx(t)f Y (s)\ 



C(p,a)C(q,a) J RP+q \t\% 



a 

2 

- dt ds. 



ia+pi \a+q 



q 



Similarly, TZ^ is the square root of 

2(0() = V 2 (°)(X,r) = v 2 ^(X), V 2 ^(F) < oo, 

Vv 2 ( Q )(x)v 2 («)(y) 

and ftM = if V 2( - a) (X)V 2( - a) (Y) = 0- 

Now consider the Levy fractional Brownian motion {W^(t),t 6 M d } with 
Hurst index H E (0, 1), which is a centered Gaussian random process with 
covariance function 

E\W&(t)W&(s)] = \t\ 2H + \s\ 2H -\t- s\ 2H , t, s e R d . 

See Herbin and Merzbach [15]. 

In the following, (Wh,W h *) and (X, X' ,Y, Y') are supposed to be inde- 
pendent. 

Using Lemma 1, it can be shown for Hurst parameters < H, H* < 1, 
h := 2H, and h* := 2H* , that 

Cov 2 , (X,Y) 

= 1 f f \f(t,s)-f(t)g(s)\ 2 dtds 

c( P ,h- 

(4.1) 



C(p,h)C(q,h*)J UP J Rq \t\ p p +h \s\ g +h * 
E\X - X'\*\Y - Y'\\* + E\X - X'\*E\Y - r'|J* 
- - X't\Y - Y"\ h * - E\X - X"t\Y - Y' ]h * 



Here we need to suppose that E\X\ 2h < oo, E\Y\ 2h <oo. Observe that when 
h = h* = 1, (4.1) is equation (3.7) of Theorem 7. 

The corresponding statistics are defined by replacing the exponent 1 with 
exponent a (or h) in the distance dependence statistics (2.8), (2.10), and 



18 



G. J. SZEKELY AND M. L. RIZZO 



(2.9). That is, in the sample distance matrices replace a k i = \X k — Xi\ p with 
au = \X k - Xi\%, and replace b ki = \Y k - Y t \ q with b kl = \Y k - Y^, k,l = 
l,...,n. 

Theorem 2 can be generalized for || • || Q norms, so that almost sure con- 
vergence of Vi a) -> V (q) follows if the a-moments are finite. Similarly, one 
can prove the weak convergence and statistical consistency for a exponents, 
< a < 2, provided that a moments are finite. 

Note that the strict inequality < a < 2 is important. Although can 
be defined for a = 2, it does not characterize independence. Indeed, the 
case a = 2 (squared Euclidean distance) leads to classical product-moment 
correlation and covariance for bivariate (X,Y). Specifically, if p = q = 1, 

then TZ^ = \p\, Tlffi = \p\, and Vn^ = 2|o" xy |, where a xy is the maximum 
likelihood estimator of Pearson covariance <r x y = a(X, Y). 

4.2. Affine invariance. Independence is preserved under affine transfor- 
mations hence it is natural to consider dependence measures that are affine 
invariant. We have seen that 1Z(X, Y) is invariant with respect to orthogonal 
transformations 

(4.2) X^cn + hdX, Y^a 2 + b 2 C 2 Y, 

where a±, a 2 are arbitrary vectors, b±, b 2 are arbitrary nonzero numbers, 
and Ci, C 2 are arbitrary orthogonal matrices. We can also define a distance 
correlation that is affine invariant. Define the scaled samples X* and Y* by 

(4.3) X*=XS^ 1/2 , Y*=Y5^ 1/2 , 

where Sx and Sy are the sample covariance matrices of X and Y respec- 
tively. The sample vectors in (4.3) are not invariant to affine transformations, 
but the distances, \X% — X^\ and \Y k * — Y*\, k, I = 1, . . . ,n, are invariant to 
affine transformations. Thus, an affine distance correlation statistic can be 
defined by its square 

K 2 (x,y^- V '( X *' Y *) 



VV2(X*)V2(Y*) 

Theoretical properties established for V n and TZ n also hold for V* and 
IZn, because the transformation simply replaces the original weight function 

{CpCq \t\p^ P \s\q 9 } 1 with {CpCg|X^ t\p ^jSy s\q ^} ^ . 

4.3. Rank test. In the case of bivariate (X,Y) one can also consider 
a distance covariance test of independence for rank(X), rank(Y), which 
has the advantage that it is distribution free and invariant with respect 
to monotone transformations of X and Y, but usually at a cost of lower 
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power than the dCov(X, Y) test (see Example 1). The rank-dCov test can be 
applied to continuous or discrete data, but for discrete data it is necessary to 
use the correct method for breaking ties. Any ties in ranks should be broken 
randomly, so that a sample of size n is transformed to some permutation of 
the integers l:n. A table of critical values for the statistic nlZ^, based on 
Monte Carlo results, is provided in Table 2 in the Appendix. 

5. Applications. 

5.1. Nonlinear and nonmonotone dependence. Suppose that one wants 
to test the independence of X and Y, where X and Y cannot be observed 
directly, but can only be measured with independent errors. Consider the 
following: 

(i) Suppose that X{ can only be measured through observation of Ai = 
Xi + £j , where £i are independent of Xi , and similarly for Y{ . 

(ii) One can only measure (non) random functions of X and Y, for ex- 
ample, Ai = 4>{Xi) and Bi = tp(Yi). 

(hi) Suppose both (i) and (ii) for certain types of random <fi and ip. 

In all of these cases, even if (X,Y) were jointly normal, the dependence be- 
tween (A, B) can be such that the correlation of A and B is almost irrelevant, 
but dCor(^4, B) is obviously relevant. 

In this section we illustrate a few of the many possible applications of 
distance covariance. The dCov test has been applied using the dcov.test 
function in the energy [23] package for R [21], where it is implemented as a 
permutation test. 

5.2. Examples. 

Example 1. This example is similar to the type considered in (ii), with 
observed data from the NIST Statistical Reference Datasets (NIST StRD) 
for Nonlinear Regression. The data analyzed is Eckerle4, data from an NIST 
study of circular interference transmittance [10]. There are 35 observations, 
the response variable is transmittance, and the predictor variable is wave- 
length. A plot of the data in Figure 2(a) reveals that there is a nonlinear rela- 
tion between wavelength and transmittance. The proposed nonlinear model 
is 



where /3i,02 > 0, /?3 € K, and e is random error. In the hypothesized model, 
Y depends on the density of X. 

Results of the dCov test of independence of wavelength and transmittance 
are 





Fig. 2. The Eckerle4 data (a) and plot of residuals vs predictor variable for the NIST 
certified estimates (b), in Example 1. 



dCov test of independence 
data: x and y 

nV~2 = 8.1337, p-value = 0.021 
sample estimates: 

dCor 
0.4275431 

with lZ n = 0.43, and dCov is significant (p-value = 0.021) based on 999 repli- 
cates. In contrast, neither Pearson correlation p = 0.0356, (p-value = 0.839) 
nor Spearman rank correlation p s = 0.0062 (p-value = 0.9718) detects the 
nonlinear dependence between wavelength and transmittance, even though 
the relation in Figure 2(a) appears to be nearly deterministic. 

The certified estimates (best solution found) for the parameters are re- 
ported by NIST as fa = 1.55438, /3 2 = 4.08883, and /3 3 = 451.541. The resid- 
uals of the fitted model are easiest to analyze when plotted vs the predictor 
variable as in Figure 2(b). Comparing residuals and transmittance, 

dCov test of independence 
data: y and res 

nV~2 = 0.0019, p-value = 0.019 
sample estimates: 

dCor 
0.4285534 

we have lZ n = 0.43 and the dCov test is significant (p-value = 0.019) based 
on 999 replicates. Again the Pearson correlation is nonsignificant (p = 0.11, 
p-value = 0.5378). 
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Although nonlinear dependence is clearly evident in both plots, note that 
the methodology applies to multivariate analysis as well, for which residual 
plots are much less informative. 

Example 2. In the model specification of Example 1, the response vari- 
able Y is assumed to be proportional to a normal density plus random error. 
For simplicity, consider (X, Y) = (X, <p(X)), where X is standard normal and 
(/>(•) is the standard normal density. Results of a Monte Carlo power com- 
parison of the dCov test with classical Pearson correlation and Spearman 
rank tests are shown in Figure 3. The power estimates are computed as the 
proportion of significant tests out of 10,000 at 10% significance level. 

In this example, where the relation between X and Y is deterministic but 
not monotone, it is clear that the dCov test is superior to product moment 
correlation tests. Statistical consistency of the dCov test is evident, as its 
power increases to 1 with sample size, while the power of correlation tests 
against this alternative remains approximately level across sample sizes. We 
also note that distance correlation applied to ranks of the data is more 
powerful in this example than either correlation test, although somewhat 
less powerful than the dCov test on the original (X, Y) data. 

Example 3. The Saviotti aircraft data [24] record six characteristics of 
aircraft designs which appeared during the twentieth century. We consider 
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Fig. 3. Example 2: Empirical power at 0.1 significance and sample size n. 
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two variables, wing span (m) and speed (km/h) for the 230 designs of the 
third (of three) periods. This example and the data (aircraft) are from Bow- 
man and Azzalini [5, 6]. A scatterplot on log-log scale of the variables and 
contours of a nonparametric density estimate are shown in Figures 4(a) and 
4(b). The nonlinear relation between speed and wing span is quite evident 
from the plots. 

The dCov test of independence of log(Speed) and log(Span) in period 3 
is significant (p- value = 0.001), while the Pearson correlation test is not 
significant (p- value = 0.8001). 

dCov test of independence 
data: logSpeed3 and logSpan3 
nV~2 = 3.4151, p-value = 0.001 
sample estimates: 

dCor 
. 2804530 

Pearson's product-moment correlation 
data: logSpeed3 and logSpan3 
t = 0.2535, df = 228, p-value = 0.8001 

alternative hypothesis: true correlation is not equal to 
95 percent confidence interval: 
-0.1128179 0.1458274 
sample estimates: 
cor 

0.01678556 




2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0 

log(Span) log(Span) 



(a) (b) 



Fig. 4. Scatterplot and contours of density estimate for the aircraft speed and span vari- 
ables, period 3, in Example 3. 
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The sample estimates are p = 0.0168 and lZ n = 0.2805. Here we have an 
example of observed data where two variables are nearly uncorrelated, but 
dependent. We obtained essentially the same results on the correlations of 
ranks of the data. 

Example 4. This example compares dCor and Pearson correlation in 
exploratory data analysis. Consider the Freedman [13, 31] data on crime 
rates in US metropolitan areas with 1968 populations of 250,000 or more. 
The data set is available from Fox [12], and contains four numeric variables: 

population (total 1968, in thousands), 
nonwhite (percent nonwhite population, 1960), 
density (population per square mile, 1968), 
crime (crime rate per 100,000, 1969). 

The 110 observations contain missing values. The data analyzed are the 
100 cities with complete data. Pearson p and dCor statistics lZ n are shown in 
Table 1. Note that there is a significant association between crime and pop- 
ulation density measured by dCor, which is not significant when measured 
by p. 

Analysis of this data continues in Example 5. 

Example 5 (Influential observations). When V n and lZ n are computed 
using formula (2.8), it is straightforward to apply a jackknife procedure to 
identify possible influential observations or to estimate standard error of V n 
or lZ n . A 'leave-one-out' sample corresponds to (n — 1) x (n — 1) matrices 
Au\fci and Bu^m, where the subscript (i) indicates that the ith observation is 
left out. Then is computed from distance matrix A = (aki) by omitting 
the ith row and the ith column of A, and similarly Bus^ is computed from 
B = (bki) by omitting the ith row and the ith column of B. Then 

Vf,)(X,Y) = X] A (i)kl B (i)kh i = l,...,n, 

^ ' k,l^i 

Table 1 

Pearson correlation and distance correlation statistics for the Freedman data of Example 
4- Significance at 0.05,0.01,0.001 for the corresponding tests is indicated by *,**,* **, 

respectively 







Pearson 






dCor 




Nonwhite 


Density 


Crime 


Nonwhite 


Density 


Crime 


Population 


0.070 


0.368"* 


0.396*** 


0.260* 


0.615*** 


0.422** 


Nonwhite 




0.002 


0.294** 




0.194 


0.385*** 


Density 






0.112 






0.250* 
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are the jackknife replicates of V^, obtained without recomputing matrices 
A and B. Similarly, T^h\ can be computed from the matrices A and B. A 
jackknife estimate of the standard error of lZ n is thus easily obtained from 
the matrices A,B (on the jackknife, see, e.g., Efron and Tibshirani [11]). 

The jackknife replicates TZu-\ can be used to identify potentially influential 
observations, in the sense that outliers within the sample of replicates corre- 
spond to observations that increase or decrease the dependence coefficient 
more than other observations. These unusual replicates are not necessarily 
outliers in the original data. 

Consider the crime data of Example 4. The studentized jackknife repli- 
cates 7£(j)/se(7?.(j)), i = 1, ... ,n, are plotted in Figure 5(a). These replicates 
were computed on the pairs (x,y), where x is the vector (nonwhite, density, 
population) and y is crime. The plot suggests that Philadelphia is an unusual 
observation. For comparison we plot the first two principal components of 
the four variables in Figure 5(b), but Philadelphia (PHIL) does not appear 
to be an unusual observation in this plot or other plots (not shown), includ- 
ing those where log(population) replaces population in the analysis. One can 
see from comparing 

population nonwhite density crime 
PHILADELPHIA 4829 15.7 1359 1753 

with sample quartiles 

population nonwhite density crime 
0'/. 270.00 0.300 37.00 458.00 



3 ° & o oo 



40 60 
Observation 



(a) 



2000 4000 6000 8000 
Principal Component 1 



(b) 



Fig. 5. Jackknife replicates of dCor (a) and principal components of Freedman data (b) 
in Example 5. 
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257. 398.75 3.400 266.50 2100.25 

50/„ 664.00 7.300 412.00 2762.00 

757. 1167.75 14.825 773.25 3317.75 

1007. 11551.00 64.300 13087.00 5441.00 



that crime in Philadelphia is low while population, nonwhite, and density 
are all high relative to other cities. Recall that all Pearson correlations were 
positive in Example 4. 

This example illustrates that having a single multivariate summary statis- 
tic dCor that measures dependence is a valuable tool in exploratory data 
analysis, and it can provide information about potential influential observa- 
tions prior to model selection. 

Example 6. In this example we illustrate how to isolate the nonlinear 
dependence between random vectors to test for nonlinearity. 

Gumbel's bivariate exponential distribution [14] has density function 

f(x, y; 9) = [(1 + 9x)(l + 9y)\ exp(-x -y- Oxy), x, y > 0; < 9 < 1. 

The marginal distributions are standard exponential, so there is a strong 
nonlinear, but monotone dependence relation between X and Y. The con- 
ditional density is 

f(y\x) = e-( 1+e ^[(l + 9x)(l + 9y)-9], y>0. 

If 9 = 0, then fx,y{x,y) = fx(x)fY(y) and independence holds, so p = 0. 
At the opposite extreme, if 9 = 1, then p = —0.40365 (see Kotz, Balakrish- 
nan, and Johnson [18], Section 2.2). Simulated data was generated using the 
conditional distribution function approach outlined in Johnson [17]. Empir- 
ical power of dCov and correlation tests for the case 9 = 0.5 are compared 
in Figure 6(a), estimated from 10,000 test decisions each for sample sizes 
{10:100(10), 120:200(20), 250, 300}. This comparison reveals that the cor- 
relation test is more powerful than dCov against this alternative, which is 
not unexpected because ^[y|X = x] = (1 + 9 + x9)/(l + x9) 2 is monotone. 

While we cannot split the dCor or dCov coefficient into linear and nonlin- 
ear components, we can extract correlation first and then compute dCor on 
the residuals. In this way one can separately analyze the linear and nonlinear 
components of bivariate or multivariate dependence relations. 

To extract the linear component of dependence, fit a linear model Y = 
X/3 + e to the sample (X, Y) by ordinary least squares. It is not necessary to 
test whether the linear relation is significant. The residuals fj = Xi(3 — Yi are 
uncorrelated with the predictors X. Apply the dCov test of independence 
to (X,e). 

Returning to the Gumbel bivariate exponential example, we have ex- 
tracted the linear component and applied dCov to the residuals of a simple 
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Fig. 6. Power comparison of dCov and correlation tests at 10% significance level for 
Gumbel's bivariate exponential distribution in Example 6. 



linear regression model. Repeating the power comparison described above 
on (X, e) data, we obtained the power estimates shown in Figure 6(b). One 
can note that power of dCov tests is increasing to 1 with sample size, ex- 
hibiting statistical consistency against the nonlinear dependence remaining 
in the residuals of the linear model. 

This procedure is easily applied in arbitrary dimension. One can fit a 
linear multiple regression model or a model with multivariate response to 
extract the linear component of dependence. This has important practical 
application for evaluating models in higher dimensions. 

More examples, including Monte Carlo power comparisons for random 
vectors in dimensions up to p = q = 30, are given in Szekely et al. [28]. 

6. Summary. Distance covariance and distance correlation are natural 
extensions and generalizations of classical Pearson covariance and correla- 
tion in at least two ways. In one direction we extend the ability to measure 
linear association to all types of dependence relations. In another direction 
we extend the bivariate measure to a single scalar measure of dependence 
between random vectors in arbitrary dimension. In addition to the obvi- 
ous theoretical advantages, we have the practical advantages that the dCov 
and dCor statistics are computationally simple, and applicable in arbitrary 
dimension not constrained by sample size. 

We cannot claim that dCov is the only possible or the only reasonable 
extension with the above mentioned properties, but we can claim that our 
extension is a natural generalization of Pearson's covariance in the follow- 
ing sense. We defined the covariance of random vectors with respect to a 
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pair of random processes, and if these random processes are i.i.d. Brownian 
motions, which is a very natural choice, then we arrive at the distance co- 
variance; on the other hand, if we choose the simplest nonrandom functions, 
a pair of identity functions (degenerate random processes), then we arrive 
at Pearson's covariance. 

We have illustrated only a few of the many applications where distance 
correlation may provide additional information not measured by classical 
correlation or arrays of bivariate statistics. In exploratory data analysis, dis- 
tance correlation has the flexibility to be applied as a multivariate measure of 
dependence, or measure of dependence among any of the lower dimensional 
marginal distributions. 

The general linear model is fundamental in data analysis for several rea- 
sons, but often a linear model is not adequate. We can test for linearity 
using dCov as shown in Example 6. Although illustrated for simple linear 
regression, the basic method is applicable for all types of i.i.d. observations, 
including longitudinal data or other data with multivariate predictors and/or 
multivariate response. 

In summary, distance correlation is a valuable, practical, and natural tool 
in data analysis and inference that extends the good properties of classical 
correlation to multivariate analysis and the general hypothesis of indepen- 
dence. 

APPENDIX A: PROOFS OF STATEMENTS 

For M. d valued random variables, \ -\d denotes the Euclidean norm; when- 
ever the dimension is self-evident we suppress the index d. 

A.l. Proof of Theorem 3(iii) and (vi). 

Proof. Starting with the left side of the inequality (iii), 
V(X 1 +X 2 ,Y 1 +Y 2 ) 

= WfXi+X 2 ,Yi+Y 2 (t,s) - fx 1 +X 2 (t)fY 1 +Y 2 (s)\\ 

= \\fx u YAt,s)fx 2 Mt,s) ~ fxAt)fx 2 (t)f Yl {s)f Y2 {s)\\ 

(A.l) < \\fx 1 Mt,s)(fx 2 ,Y 2 (t,s)-fx 2 (t)fy 2 (s))\\ 

+ Wfx 2 (t)fY 2 (s)(f Xl>Yl (t,s) - (a))|| 
(A.2) <\\fx 2 Mt^)~ fx 2 (t)fy 2 (s)\\ + \\fx u ydt,s)- fxAt)fr^ 
= V(X 1 ,Y 1 ) + V(X 2 ,Y 2 ). 

It is clear that if (a) X± and Y\ are both constants, (b) X 2 and Y 2 are 
both constants, or (c) X\,X 2 , Y±, Y 2 are mutually independent, then we have 
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equality in (iii) . Now suppose that we have equality in (iii) , and thus we have 
equality above at (A.l) and (A. 2), but neither (a) nor (b) hold. Then the 
only way we can have equality at (A. 2) is if X\,Y\ are independent and 
also X%,Y2 are independent. But our hypothesis assumes that (X\,Y\) and 
(X2,Y2) are independent hence (c) must hold. 

Finally, (vi) follows from (iii) . In this special case X± = Y\ = X and X2 = 
Y2 = Y . Now (a) means that X is constant, (b) means that Y is constant, 
and (c) means that both of them are constants, because this is the only case 
when a random variable can be independent of itself. □ 

A. 2. Existence of W(X, Y). To complete the proof of Theorem 7, we 
need to show that all factors in the definition of Cov^/(X, Y) have finite 
fourth moments. 

Proof. Note that E[W 2 (t)] = 2\t\, so that E[W 4 {t)} = 3{E[W 2 (t)}) 2 = 
12\t\ 2 and, therefore, 

E[W A {X)] = E[E(W\X)\X)\ = E[12\X\ 2 } < 00. 

On the other hand, by the inequality (a + 6) 4 < 2 4 (o + b ), and by Jensen's 
inequality, we have 

E(X W ) A = E[W(X) - E(w(x)\wyf 

< 2\E[W 4 {X)] + E[E(W(X)\W)]' i ) 

< 2 5 E[W 4 (X)] = 2 5 12£|X| 2 < 00. 

Similarly, the random variables X' w , Yw 1 , and Y^, also have finite fourth 
moments, hence, 

W 2 (X,Y) = E\X W X' W Y W ,Y^\ 

< \E[{X w f + (X' W ) A + {Y w ,f + {Y^f] < 00. 

Above we implicitly used the fact that £J[W(X)| W] = f Rp W(t) dF x {t) 
exists a.s. This can easily be proved with the help of the Borel-Cantelli 
lemma, using the fact that the supremum of centered Gaussian processes 
have small tails (see [19, 29]). 

Observe that 

W 2 (X,Y) = ElXwX'wYwiY^,} 

= E\E{X w X' w Y w ,Y^,\X,X\Yy)\ 

= E[E(X W X' W \X, X',Y, Y')E{Y W ,Y^,\X, X' , Y, Y% 
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Here 

X w X' w = {w{X) - [ W(t)dFx(t)\\w(X')- f W(t)dF x (t) 

I JRp J I JRp 

= W(X)W(X')- [ W(X)W(t)dF x (t) 

JRp 

- [ W(X')W(t)dF x (t)+ [ [ W{t)W{s)dF x {t)dF x {s). 

JRP JRP JRP 

By the definition of W(-), we have E[W(t)W(s)\ = \t\ + \s\ - \t-s\, thus, 
E[X W X^\X,X',Y,Y'] = \X\ + \X'\ - \X-X'\ 

- I (\X\ + \t\-\X-t\)dF x (t) 

JRP 

- [ (\X'\ + \t\-\X' -t\)dF x (t) 

JRP 

+ ff (\t\ + \s\-\t-s\)dFx(t)dF x (s). 

JRP JRP 

Hence, 

E[X W X^\X,X',Y,Y'] = \X\ + \X'\ -\X- X'\ 

- (\X\+E\X\ -E'\X-X'\) 
-{\X'\+E\X\ -E"\X'-X"\) 
+ (E\X\ + E\X'\ - E\X - X'\) 
= E'\X - X'\ + E"\X' - X"\ - \X- X'\ - E\X - X' 

where E' denotes the expectation with respect to X' and E" denotes the 
expectation with respect to X" . A similar argument for Y completes the 
proof. □ 

APPENDIX B: CRITICAL VALUES 

Estimated critical values for n7£^(rank(X), rank(Y)) are summarized in 
Table 2 for 5% and 10% significance levels. The critical values are estimates 
of the 95th and 90th quantiles of the sampling distribution and were obtained 
by a large scale Monte Carlo simulation (100,000 replicates for each n). For 
sample sizes n < 10, the probabilities were determined by generating all 
possible permutations of the ranks, so the achieved significance levels (ASL) 
reported for n < 10 are exact. The rejection region is in the upper tail. 
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Table 2 

Critical values of nlZ„(rankp£), rankfY )); exact achieved significance level (ASL) for 
n < 10, and Monte Carlo estimates for n > 11. Reject independence if nR} n is greater 

than or egual to the table value 



n 


10% (ASL) 


5% (ASL) 


n 


10% 


5% 


n 


10% 


5% 


5 


3.685 (0.100) 


4.211 (0.050) 


15 


4.25 


5.16 


25 


4.26 


5.22 


6 


3.917 (0.097) 


4.699 (0.047) 


16 


4.25 


5.17 


30 


4.25 


5.22 


7 


4.215 (0.098) 


4.858 (0.047) 


17 


4.25 


5.17 


35 


4.24 


5.23 


8 


4.233 (0.099) 


4.995 (0.050) 


18 


4.25 


5.18 


40 


4.24 


5.23 


9 


4.208 (0.100) 


5.072 (0.050) 


19 


4.25 


5.20 


50 


4.24 


5.24 


10 


4.221 (0.100) 


5.047 (0.050) 


20 


4.25 


5.20 


60 


4.24 


5.25 


11 


4.23 


5.07 


21 


4.26 


5.21 


70 


4.24 


5.26 


12 


4.24 


5.10 


22 


4.26 


5.21 


80 


4.24 


5.26 


13 


4.25 


5.14 


23 


4.26 


5.21 


90 


4.24 


5.26 


14 


4.25 


5.16 


24 


4.26 


5.22 


100 


4.24 


5.26 
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